銜接昨日
這部分我們暫時先參考官網 model_vc.py 即可
把官網 solver_encoder.py 的第 90,91 行改成
# 原本會造成 shape miss-match 導致收斂過快無法學習
g_loss_id = F.mse_loss(x_real, x_identic.squeeze())
g_loss_id_psnt = F.mse_loss(x_real, x_identic_psnt.squeeze())
import torch
from solver_encoder import Solver
from data_loader import get_loader
class Config:
def __init__(self):
self.data_dir = './spmel'
self.batch_size = 2
self.len_crop = 176
self.lambda_cd = 1
self.dim_neck = 44
self.dim_emb = 256
self.dim_pre = 512
self.freq = 22
self.num_iters = 1000000
self.log_step = 10
config = Config()
vcc_loader = get_loader(config.data_dir, config.batch_size, config.len_crop)
solver = Solver(vcc_loader, config)
solver.train()
torch.save(solver.G.state_dict(), "autovc")
import IPython.display as ipd
import pickle
import torch
import numpy as np
from model_vc import Generator
device = 'cuda:0'
G = Generator(32,256,512,22).eval().to(device)
G.load_state_dict(torch.load('autovc'))
metadata = pickle.load(open('spmel/train.pkl', "rb"))
source = 0
target = 3
# 因為我的 Source 是 p226
uttr = np.load(f"spmel/p226/p226_014_mic1.npy")[50:226]
# (1,256)
emb_org = torch.from_numpy(np.expand_dims(metadata[source][1],axis=0)).to(device)
# (1,256)
emb_trg = torch.from_numpy( np.expand_dims(metadata[target][1],axis=0)).to(device)
# (1,178,80)
uttr = torch.from_numpy( np.expand_dims(uttr,axis=0)).to(device)
uttr_trg = None
with torch.no_grad():
_, x_identic_psnt, _ = G(uttr, emb_org, emb_trg)
# (176,80)
uttr_trg = x_identic_psnt[0, 0, :, :].cpu().numpy()
# To Waveform
from interface import *
vocoder = MelVocoder()
audio = np.squeeze(vocoder.inverse(torch.from_numpy(np.expand_dims(uttr_trg.T,axis=0))).cpu().numpy())
ipd.Audio(audio,rate = 22050)
你的根目錄下大概會有以下內容:
root -
- /spmel
- /wavs
- train.ipynb
- make_spec.ipynb (生 spmel 的)
- make_d_vector.ipynb (生 train.pkl 的)
- dataloader.py
- model_vc.py
- solver_encoder.py
- MelGan 的 model
- interface.py (For MelGan)
- modules.py (For MelGan)
- D_VECTOR 的 model
如果你不想 train 的話下載官方的 pre_train model 會是 16khz 的,資料前處理方式也不一樣,不可以用 MelGan 來轉。
你可以在這裡下載我訓練的 22khz 版本 ,它可以用 MelGan 轉回 Waveform
Inference 出來的效果確實跟他們官網發表(聽看看)的是一樣的!
到此我們快速 Run 了一次 Pytorch 的版本,更詳細的內容我們留到明天開始用 TF 做會更清楚,但到這裡我們已經可以體驗到聲音轉換的魅力了!
因為 TF 用習慣了總覺得比較好解釋,雖然身邊的朋友們都說 TF 沒救了,現在是 Pytorch 的時代,但是我們信仰要堅定RRR!